This report is for Team A, which includes:
The United States forest Service (UFS) considers itself a multi-faceted agency. Cumulatively, the agency manages and protects 154 national forests and 20 grasslands in 43 states and Puerto Rico including approximately 500 million acres of private, state and tribal forests, on which the UFS promotes sustainable management. The UFS mission is to sustain the health, diversity, and productivity of the nation’s forests and grasslands to meet the needs of present and future generations. Some highlights of the UFS responsibility are shown below:
o 13 billion dollars contributed to the U.S. economy by visitor spending each year o 193 million acres managed by the Forest Service o 27 million annual visits to ski areas on national forests o 7.2 million acres of wetlands o 36.6 million acres of wilderness o 400,000 acres of lakes o 57,000 miles of streams o 10,000 professional wildland firefighters o 154 national forests o 20 percent of America’s clean water supply provided by the national forests and grasslands Source: https://www.fs.fed.us/about-agency/newsroom/by-the-numbers
Like any managerial process, the UFS’s success is measured in terms of how well it maximizes its use of resources (time, people, and money) to deliver value to their customers, in this case the US taxpayer/government. To maximize their impact, the UFS has had success creating partnerships with public and private agencies that help the UFS plant trees, improve trails, educate the public, and promote sustainable forest management and biodiversity conservation domestically and internationally.
Given the large area, various climates and terrains, diverse plants and forest species, and wildlife that make surveying the locations on any kind of frequent basis, the Forest Gladiator team has been tasked with developing a predictive model(s) using existing data to improve the UFS’s management capabilities involving forest cover areas.
There is a need to predict the forest cover for 30 x 30 meter cells using cartographic variables obtained from US Forest Service (USFS) Region 2 Resource Information System (RIS) data. The objective is to predict cover type as a multiclass classification problem based on the associated attributes (features).
In our raw data it is clear that we don’t have an even distribuion of the target variable (cover types). The data set is over 85% Spruce Fir and Lodgepole Pine. Another nearly 10% are Ponerosa Pine and Krummholz. This needs to be taken into account as we split the data for training and testing. It also gives us a good benchmark for our models as we check the outputs on the test data.
From a quick skim of the data we see it is already pretty clean. There are no missing data and nothing jumps out as obviously out of the ordinary. There are clear groups of variables that we will explore for patterns such as soil type, distance measures, shade at different times of day, and wilderness area.
set.seed(10)
train_index <- caret::createDataPartition(
y = raw_data$cover_type,
p = 0.7,
times = 1,
list = FALSE
)
raw_train <- raw_data[train_index,]
raw_test <- raw_data[-train_index,]
skim(raw_train) %>% pander() Skim summary statistics
n obs: 406710
n variables: 55
| variable | missing | complete | n | n_unique |
|---|---|---|---|---|
| cover_type | 0 | 406710 | 406710 | 7 |
| top_counts | ordered |
|---|---|
| lod: 198311, spr: 148288, pon: 25028, kru: 14357 | FALSE |
| variable | missing | complete | n | mean |
|---|---|---|---|---|
| aspect | 0 | 406710 | 406710 | 155.52 |
| elevation | 0 | 406710 | 406710 | 2959.23 |
| hillshade_3pm | 0 | 406710 | 406710 | 142.49 |
| hillshade_9am | 0 | 406710 | 406710 | 212.16 |
| hillshade_noon | 0 | 406710 | 406710 | 223.31 |
| horizontal_distance_to_fire_points | 0 | 406710 | 406710 | 1981.43 |
| horizontal_distance_to_hydrology | 0 | 406710 | 406710 | 269.17 |
| horizontal_distance_to_roadways | 0 | 406710 | 406710 | 2349.75 |
| slope | 0 | 406710 | 406710 | 14.11 |
| soil_type1 | 0 | 406710 | 406710 | 0.0052 |
| soil_type10 | 0 | 406710 | 406710 | 0.056 |
| soil_type11 | 0 | 406710 | 406710 | 0.021 |
| soil_type12 | 0 | 406710 | 406710 | 0.052 |
| soil_type13 | 0 | 406710 | 406710 | 0.03 |
| soil_type14 | 0 | 406710 | 406710 | 0.001 |
| soil_type15 | 0 | 406710 | 406710 | 7.4e-06 |
| soil_type16 | 0 | 406710 | 406710 | 0.0049 |
| soil_type17 | 0 | 406710 | 406710 | 0.0059 |
| soil_type18 | 0 | 406710 | 406710 | 0.0033 |
| soil_type19 | 0 | 406710 | 406710 | 0.007 |
| soil_type2 | 0 | 406710 | 406710 | 0.013 |
| soil_type20 | 0 | 406710 | 406710 | 0.016 |
| soil_type21 | 0 | 406710 | 406710 | 0.0014 |
| soil_type22 | 0 | 406710 | 406710 | 0.058 |
| soil_type23 | 0 | 406710 | 406710 | 0.099 |
| soil_type24 | 0 | 406710 | 406710 | 0.036 |
| soil_type25 | 0 | 406710 | 406710 | 0.00083 |
| soil_type26 | 0 | 406710 | 406710 | 0.0044 |
| soil_type27 | 0 | 406710 | 406710 | 0.0019 |
| soil_type28 | 0 | 406710 | 406710 | 0.0016 |
| soil_type29 | 0 | 406710 | 406710 | 0.2 |
| soil_type3 | 0 | 406710 | 406710 | 0.0083 |
| soil_type30 | 0 | 406710 | 406710 | 0.052 |
| soil_type31 | 0 | 406710 | 406710 | 0.044 |
| soil_type32 | 0 | 406710 | 406710 | 0.09 |
| soil_type33 | 0 | 406710 | 406710 | 0.077 |
| soil_type34 | 0 | 406710 | 406710 | 0.0027 |
| soil_type35 | 0 | 406710 | 406710 | 0.0033 |
| soil_type36 | 0 | 406710 | 406710 | 0.00017 |
| soil_type37 | 0 | 406710 | 406710 | 0.00052 |
| soil_type38 | 0 | 406710 | 406710 | 0.027 |
| soil_type39 | 0 | 406710 | 406710 | 0.024 |
| soil_type4 | 0 | 406710 | 406710 | 0.021 |
| soil_type40 | 0 | 406710 | 406710 | 0.015 |
| soil_type5 | 0 | 406710 | 406710 | 0.0028 |
| soil_type6 | 0 | 406710 | 406710 | 0.011 |
| soil_type7 | 0 | 406710 | 406710 | 0.00017 |
| soil_type8 | 0 | 406710 | 406710 | 0.00031 |
| soil_type9 | 0 | 406710 | 406710 | 0.0021 |
| vertical_distance_to_hydrology | 0 | 406710 | 406710 | 46.39 |
| wilderness_area1 | 0 | 406710 | 406710 | 0.45 |
| wilderness_area2 | 0 | 406710 | 406710 | 0.052 |
| wilderness_area3 | 0 | 406710 | 406710 | 0.44 |
| wilderness_area4 | 0 | 406710 | 406710 | 0.064 |
| sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|
| 111.9 | 0 | 58 | 127 | 260 | 360 | ▇▇▆▃▃▃▃▆ |
| 280.02 | 1859 | 2809 | 2995 | 3163 | 3858 | ▁▁▂▃▇▆▁▁ |
| 38.3 | 0 | 119 | 143 | 168 | 254 | ▁▁▂▆▇▆▂▁ |
| 26.76 | 0 | 198 | 218 | 231 | 254 | ▁▁▁▁▁▂▇▇ |
| 19.78 | 0 | 213 | 226 | 237 | 254 | ▁▁▁▁▁▁▅▇ |
| 1324.59 | 0 | 1024 | 1711 | 2551 | 7173 | ▅▇▆▂▁▁▁▁ |
| 212.33 | 0 | 108 | 218 | 384 | 1397 | ▇▆▃▂▁▁▁▁ |
| 1558.92 | 0 | 1103 | 1998 | 3326 | 7117 | ▆▇▆▅▃▂▂▁ |
| 7.5 | 0 | 9 | 13 | 18 | 66 | ▅▇▅▂▁▁▁▁ |
| 0.072 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| 0.23 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| 0.14 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| 0.22 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| 0.17 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| 0.032 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| 0.0027 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| 0.07 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| 0.077 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| 0.057 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| 0.083 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| 0.11 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| 0.13 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| 0.037 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| 0.23 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| 0.3 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| 0.19 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| 0.029 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| 0.066 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| 0.043 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| 0.041 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| 0.4 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▂ |
| 0.091 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| 0.22 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| 0.21 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| 0.29 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| 0.27 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| 0.052 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| 0.057 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| 0.013 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| 0.023 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| 0.16 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| 0.15 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| 0.14 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| 0.12 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| 0.052 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| 0.11 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| 0.013 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| 0.018 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| 0.045 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| 58.33 | -166 | 7 | 29 | 69 | 601 | ▁▇▇▂▁▁▁▁ |
| 0.5 | 0 | 0 | 0 | 1 | 1 | ▇▁▁▁▁▁▁▆ |
| 0.22 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
| 0.5 | 0 | 0 | 0 | 1 | 1 | ▇▁▁▁▁▁▁▆ |
| 0.24 | 0 | 0 | 0 | 0 | 1 | ▇▁▁▁▁▁▁▁ |
If we set aside the larger groups of variables, we get three other variables to look at: aspect, elevation, and slope. A quick look shows that elevation may help separate the cover types. There may also be interesting ways to combine these variables.
raw_train %>%
dplyr::select(cover_type, aspect, elevation, slope) %>%
gather(-cover_type, key = key, value = val) %>%
ggplot(aes(cover_type, val, fill = cover_type)) +
geom_boxplot() +
coord_flip() +
facet_grid(~key, scales = "free") +
theme(legend.position="none")A density plot of elevation shows there is overlap but the groups are reasonably separated. This will likely be an important variable for our models.
raw_train %>%
ggplot(aes(x=elevation)) +
geom_density(aes(group=cover_type, color=cover_type, fill=cover_type), alpha=0.3)The first group of variables are the 40 soil types. By plotting the proportion of each cover type contains each soil type we get some obvious clusters. Likely in modeling we will want to group the soil types. Initial insights show that Krumholz is mostly in soil types 30 through 40. Ponderosa Pine and Cottonwood Willow are 0 through 20.
Soil types present in a cell can help eliminate the options for cover type in our model.
train_soil <- raw_train %>%
dplyr::select(cover_type, contains("soil")) %>%
gather(-cover_type, key = soil, value = val) %>%
mutate(soil = parse_number(soil), n=1)
train_soil %>%
group_by(cover_type, soil) %>%
summarise(total = sum(val),
prop = total / sum(n)) %>%
ggplot(aes(soil, prop, fill = cover_type)) +
geom_col() +
facet_grid(~cover_type) +
theme(axis.text.x = element_text(size = 6, angle=45),
legend.position="none") +
coord_flip()Looking further into the soil types we can see that cover types usually have mostly soil types over or under 20, not both, except for aspen which is spread across the spectrum.
train_soil %>%
dplyr::filter(val > 0) %>%
ggplot(aes(cover_type, val, fill = soil)) +
geom_col(position="fill") +
coord_flip()The next grouping of variables is wilderness area. Again we see some clear delineations aong the cover types. Cottonwood Willow is only in area four. Ponderosa Pine and Douglasfir have three and four, whereas the reaining cover types are primarily three and one.
train_wild <- raw_train %>%
dplyr::select(cover_type, contains("wilderness")) %>%
gather(-cover_type, key = wild, value = val) %>%
mutate(wild = parse_number(wild), n=1)
train_wild %>%
group_by(cover_type, wild) %>%
summarise(total = sum(val),
prop = total / sum(n)) %>%
ggplot(aes(wild, prop, fill = cover_type)) +
geom_col() +
facet_grid(~cover_type) +
theme(axis.text.x = element_text(size = 6, angle=45),
legend.position="none") +
coord_flip()Another view of the proportions confirms this.
train_wild %>%
dplyr::filter(val > 0) %>%
ggplot(aes(cover_type, val, fill = wild)) +
geom_col(position="fill") +
coord_flip()Interestingly wilderness areas three and one are negatively correlated. This may be useful for our models.
raw_train %>%
dplyr::select(contains("wilderness")) %>%
cor() %>%
melt() %>%
mutate(v1 = parse_number(Var1),
v2 = parse_number(Var2)) %>%
ggplot(aes(v1, v2, fill = value)) +
geom_tile()Combining soil types and wilderness areas into one plot gives us some obvious clusters in the training data. Further investigation of how to best group these variables will be useful for the modeling phase.
raw_train %>%
dplyr::select(cover_type, contains("soil"), contains("wilderness")) %>%
mutate(n=1) %>%
gather(contains("soil"), key = soil, value = soil_val) %>%
gather(contains("wilderness"), key = wild, value = wild_val) %>%
dplyr::filter(soil_val > 0, wild_val > 0) %>%
group_by(cover_type, soil, wild) %>%
summarise(n = sum(n)) %>%
ungroup() %>%
group_by(cover_type) %>%
mutate(prop = n / sum(n)) %>%
mutate(soil = parse_number(soil),
wild = parse_number(wild)) %>%
ungroup() %>%
ggplot(aes(wild, soil, color=cover_type, size = prop)) +
geom_point() +
theme(legend.position="none") +
facet_grid(~cover_type) Shade measurements at various times of day show that some cover types have more variation than others. The pines and Spruce Fir are more likely to have lower shade values (<100) than the other cover types.
raw_train %>%
dplyr::select(cover_type, contains("hillshade")) %>%
gather(-cover_type, key = shade, value = val) %>%
mutate(shade = factor(shade, levels = c("hillshade_9am", "hillshade_noon", "hillshade_3pm"))) %>%
ggplot(aes(cover_type, val)) +
geom_jitter(aes(color = shade), alpha = 0.05, stroke = 0) +
geom_boxplot(aes(fill = shade)) +
coord_flip() +
facet_grid(~shade) +
theme(legend.position="none")Similarly, the distance variables have cutoffs that can help identify a cover type. Lodgepole Pine and Spruce Fir have much wider ranges f distance measurements than the other cover types.
raw_train %>%
dplyr::select(cover_type, contains("distance")) %>%
gather(-cover_type, key = dist, value = val) %>%
ggplot(aes(cover_type, val)) +
geom_jitter(aes(color = dist), alpha = 0.05, stroke = 0) +
geom_boxplot(aes(fill = dist)) +
coord_flip() +
facet_grid(~dist, scales = "free") +
theme(legend.position="none")Looking at the horizontal and vertical distance to hydrology, most cover types have similar, positive correlations. Combining these variables may further help in distinguishing our target variable.
raw_train %>%
dplyr::select(cover_type, contains("hydrology")) %>%
ggplot(aes(horizontal_distance_to_hydrology, vertical_distance_to_hydrology)) +
geom_point(aes(color = cover_type), alpha = 0.01, stroke = 0) +
geom_smooth() +
facet_grid(~cover_type) +
theme(legend.position="none")## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
A preliminary Random forest model was constructed to gain a deeper understanding of the relationships between variables and predictors.
## Create a data.table from data.frame
#cover_data <- as.data.table(raw_train)
#rm(raw_train)
##run model in parallel
#cl <- makeCluster(detectCores())
#registerDoParallel(cl)
## Run a base forest model for EDA
#base_forest <- randomForest(cover_type ~ ., data=cover_data, mtry=sqrt(ncol(cover_data)),
# ntree = 300, importance =T, do.trace=50)
#
#stopCluster(cl)
#base_forest
#saveRDS(base_forest, file = '../cache/base_forest.RData')base_forest <- readRDS('../cache/base_forest.RData')
base_forest##
## Call:
## randomForest(formula = cover_type ~ ., data = cover_data, mtry = sqrt(ncol(cover_data)), ntree = 300, importance = T, do.trace = 50)
## Type of random forest: classification
## Number of trees: 300
## No. of variables tried at each split: 7
##
## OOB estimate of error rate: 16.12%
## Confusion matrix:
## spruce_fir lodgepole_pine ponderosa_pine
## spruce_fir 122349 25223 63
## lodgepole_pine 18407 177880 1714
## ponderosa_pine 2 1774 22919
## cottonwood_Willow 0 0 831
## aspen 332 4642 273
## douglasfir 14 2276 5149
## krummholz 3399 83 2
## cottonwood_Willow aspen douglasfir krummholz class.error
## spruce_fir 0 2 10 641 0.17492312
## lodgepole_pine 5 79 151 75 0.10302505
## ponderosa_pine 64 3 266 0 0.08426562
## cottonwood_Willow 1085 0 7 0 0.43577743
## aspen 0 1388 11 0 0.79115257
## douglasfir 55 0 4663 0 0.61643498
## krummholz 0 0 0 10873 0.24266908
The variable importnace plot builds upon what is seen in the EDA, and indicates what features should be pursued further. The plot on the left indicates which variables’ inclusion reduces the classification error rate. The greater the MDA (Mean Decrease in Accuracy) the more important this variable is. The plot on the right indicates which variables contribute to the highest homogeneity in the nodes. The greater the mean decrease in Gini the higher the variables contribution to node purity.
varImpPlot(base_forest)